Question 1

Question 1 instructions:

There is a game where I continue to flip a coin until I get heads. If the first time I get heads is on the nth coin, then I pay you 2n-1 dollars. You can assume #that this is a fair coin. Create a program to show how much you should pay me to #play this game? Please make sure to include your results after running 1,000, 10,000, and 100,000 trials.

Answer

This scenario can be modeled using the negative binomial distribution. If n is the number of trials until the first heads, then the expected value of n will be given by r / p, where r is the number of heads p is the probability of heads occurring on any given trial

In this problem, we are interested in the first occurrence of heads, so we can denote r = 1, and the coin is fair, so the probability of success p = .5 Therefore, the expected value of n is 1 / .5 = 2

The payout associated with n trials 2n - 1, so given r = 1 and p = .5 we expect an average payout of 2 * 2 - 1 = $3. We expect to see this parametric result through simulation as well

set.seed(1)
library(parallel) # for `mclapply()` and `detectCores()`
# Define a function to run the game
game = function(.) # dot is a place holder, this fn takes no arguments
{
  # Run one iteration of the game to generate a payout.
  n <- 1 # start w/ the first trial
  # Flip the coin. If it is not heads, iterate n and re-run
  # Since the expected number of trials is 2
  while (sample(c(T, F), size = 1)) {
    # Note, since the probability of heads and tails are equal,
    # we don't need to specify if T or F is "heads" or "tails"
    n = n + 1 # record that we will initiate the n+1-th trial next
  }
  return(2*n - 1) # return the payout
}
# Define a function to run the game i times, returning the vector of payouts
payouts = function(i) {
  # Given a number of iterations to run, implement the game and return a vector
  # containing the payouts associated with each iteration.
  # i is numeric: the number of iterations of the game
  # returns a numeric vector of length n containing the payout from each iteration
  require(parallel)
  # for each iteration i, play the game
  # Thus, use mclapply to play the game in parallel i times
  # Use detectCores() to detect how many cores this machine has
  # Since I know I am running this on my personal 4-core machine, I am willing
  # to use all cores. Usually use detectCores() - 1 (atleast) on shared resources
  # Note: use unlist so that the result is a vector, not a list
  # return the i-length vector of payouts
  return(unlist(mclapply(X = 1:i, FUN = game, mc.cores = detectCores())))
}
# First simulation
t0 = Sys.time()
payout1000 = payouts(1000)
(t1000 = Sys.time() - t0) # record the time to run
## Time difference of 0.05900812 secs
# Next simulation
t0 = Sys.time()
payout10000 = payouts(10000)
(t10000 = Sys.time() - t0)
## Time difference of 0.603652 secs
# Final simulation
payout100000 = payouts(100000)
(t100000 = Sys.time() - t0)
## Time difference of 1.578739 secs
payouts = list(payout1000, payout10000, payout100000)
# On average, the mean payout is 
sapply(payouts, mean)
## [1] 2.9320 3.0362 3.0023

As a result I would not pay more than $2.99 to play the game, and playing the game 100000 at $2.99 is expected to yield greater cumulative payout than playing 10000 or 1000 times.

End of Question 1

Question 2

rm(list=ls())
library(lubridate)
library(knitr)
library(viridis)
library(lme4)
library(lmerTest)
library(kableExtra)
library(MLmetrics)
library(caret)
library(tseries)
library(forecast)
#library(tidymodels)
suppressPackageStartupMessages(library(tidyverse))

Setup

set.seed(1)

Read in the data

df <- read_tsv("../data/Question2.tsv")

Get a sense of data size/shape

print(dim(df))
## [1] 981   3
kbl(head(df), 'html', align = 'r') %>% kable_styling()
Date Cost Revenue
2019-08-20 $4,214.69 $14,349.50
2019-08-21 $2,627.13 $13,463.13
2019-08-22 $3,196.08 $10,297.64
2019-08-23 $1,134.45 $6,296.62
2019-08-24 $3,207.85 $10,291.51
2019-08-25 $7,325.59 $24,733.41

Quality Control

Check for NA and duplicates

df %>%
  summarise(across(everything(), list(zeros = ~sum(is.na(.)))),
            duplicated.days = sum(duplicated(Date))) %>%
  kbl('html') %>% kable_styling()
Date_zeros Cost_zeros Revenue_zeros duplicated.days
0 0 0 1

There is one duplicated day, and there are no NAs

df %>%
  filter(Date == df$Date[duplicated(df$Date)]) %>%
  kbl('html') %>% kable_styling()
Date Cost Revenue
2021-09-03 $5,432.35 $20,120.06
2021-09-03 $5,432.35 $20,120.06

The above shows that the corresponding Cost and Revenue are duplicated for the duplicated date, so just delete one instance

df = df %>% filter(!duplicated(Date))

Data wrangling, data types, re-coding, and feature engineering

  1. We’ll convert Cost and Revenue, which read in as character/string data, to numeric values
  2. For convenience, we’ll extract day, month, and year, as well as “year day” and “month day” from the Date column to aid in visualization and possibly modeling.
  3. Compute a Profit variable, estimated as \(Revenue - Cost\)
df1 = df %>%
  mutate(
    # convert the cost and revenue columns from character strings of the format
    # $ [n-digit dollars] . [two-digit cents]
    across(c(Cost, Revenue), ~{
      dollars = as.numeric(gsub('$|[[:punct:]]', '', substr(., 1, nchar(.) - 2)))
      cents = substr(., nchar(.) - 1, nchar(.))
      as.numeric(paste(dollars, cents, sep = '.'))
    }),
    # From the date column, it may be handy to have the day, the month, year,
    # the day within year (from 0 - 365 excluding leap years), day w/in month
    day = day(Date), month = month(Date), year = factor(year(Date)), 
    yday = yday(Date), mday=mday(Date), # day w/in year and month
    # For visualization it can be helpful to have just the month-day (no year)
    month_and_day = as.factor(sub('.{5}', '', Date)),
    # Code the variable profit, estimated by Revenue - cost
    #Profit = Revenue - Cost, efficiency = Revenue / Cost)
    Profit = Revenue - Cost)
head(df1) %>%
  kbl('html') %>% kable_styling()
Date Cost Revenue day month year yday mday month_and_day Profit
2019-08-20 4214.69 14349.50 20 8 2019 232 20 08-20 10134.81
2019-08-21 2627.13 13463.13 21 8 2019 233 21 08-21 10836.00
2019-08-22 3196.08 10297.64 22 8 2019 234 22 08-22 7101.56
2019-08-23 1134.45 6296.62 23 8 2019 235 23 08-23 5162.17
2019-08-24 3207.85 10291.51 24 8 2019 236 24 08-24 7083.66
2019-08-25 7325.59 24733.41 25 8 2019 237 25 08-25 17407.82

Exploratory and Graphical Data Analysis

View the distribution of our response Revenue

ggplot(df1) +
  geom_density(aes(x = Revenue, fill = year), color='black', alpha = .5, linewidth=.5) +
  theme_minimal() +
  scale_fill_viridis_d() +
  facet_grid(rows = vars(year))

This variable shows a high degree of positive skew that is pretty consistent across years. Let’s view it on a log scale:

ggplot(df1) +
  geom_density(aes(x = Revenue, fill = year), color='black', alpha = .5, linewidth=.5) +
  theme_minimal() +
  scale_fill_viridis_d() +
  facet_grid(rows = vars(year)) +
  scale_x_log10()

Transforming the axes (and implicitly visualizing log(Revenue)) reveals a trend across years, and also normalizes the data fairly well. Overall result on the variable after log-transformation:

df1 %>%
  ggplot +
  geom_density(aes(x = log(Revenue)), 
               fill='skyblue', color='black', alpha = .5, linewidth=.5) +
  theme_minimal() 

Visualize yearly time series of log(Revenue) and log(Profit)

# Format data for plotting
plt.dat = df1 %>%
  mutate(`log(Revenue)` = log(Revenue), `log(Profit)` = log(Profit))
# plot of: Revenue (log transformed)
plt.dat %>% 
  ggplot +
  geom_line(aes(x = yday, y = `log(Revenue)`, color = year, group = year)) +
  # Label just the first day of each month:
  scale_x_continuous(
    name = 'Month',
    breaks = as.numeric(plt.dat$yday)[(plt.dat$mday == 1)&(plt.dat$year==2021)],
    labels = month(1:12, label = T)
  )  +
  # set relevant y-axis
  scale_y_continuous(limits = c(8,12.25)) +
  theme_minimal()

# plot of: Profit = Revenue - Cost (log transformed)
plt.dat %>% 
  ggplot +
  geom_line(aes(x = yday, y = `log(Profit)`, color = year, group = year)) +
  # Label just the first day of each month:
  scale_x_continuous(
    name = 'Month',
    breaks = as.numeric(plt.dat$yday)[(plt.dat$mday == 1)&(plt.dat$year==2021)],
    labels = month(1:12, label = T)
  )  +
  # set relevant y-axis
  scale_y_continuous(limits = c(8,12.25)) +
  theme_minimal()

Conclusions from plot of \(\log(Revenue)\)

  1. For a year that we have complete data, 2021 shows the most revenue, and this appears to be a yearly trend such that 2019 < 2020 < 2021 < 2022.

  2. All revenue (in log units) appears to increase approximately monotonically from January to the holiday season, such that, each month is higher on average than the previous. However a sharp decline starts mid December to make this rule inaccurate overall.

  3. 2020 appears to outlie relative to other years. It shows a lot more variability in the slow-frequency trend, with a surprisingly high Jan - Mar, followed by a large drop and then a big rebound. The dip and rebound may be due to the start of the COVID-19 pandemic, or noise.

Conclusions from plot of \(\log(Profit)\)

  1. Profit is tightly correlated with revenue, and redundant in this case

Relationships between Cost and other variables

(log-)Linear relationship between Cost and Revenue

ggplot(df1, aes(x = log(Cost), y = log(Revenue))) +
  geom_point() +
  geom_smooth(method = 'lm', formula = 'y~x') +
  facet_wrap(~year,scales = 'fixed') +
  theme_minimal()

This graph shows that despite the long positive skew in the distributions of Cost and Revenue, the relationship between Cost and Revenue is reliable even at extreme values of both. There appears to be a strong correlation between Cost and Revenue, and the relationship between Cost and log-transformed Revenue appears to be linear. As such, it appears appropriate to model log(Revenue) going forward. Finally, this trend appears consistent between across years, however, albeit based on fewer data points Cost appears to be less effective in generating Revenue in 2022.

Dynamic relationship between Cost and Revenue

df1 %>%
  pivot_longer(cols = c('Revenue', 'Cost'), names_to = 'name', values_to = 'value') %>%
  ggplot +
  geom_line(aes(yday, y = log(value), color = name, group = name)) +
  scale_x_continuous(
    name = 'Month',
    breaks = as.numeric(df1$yday)[(df1$mday == 1)&(df1$year==2021)],
    labels = month(1:12, label = T),
  ) +
  scale_y_continuous(limits = c(6.5,12.25)) +
  facet_grid(rows = vars(year)) +
  theme_bw()

Correlation between Cost and Revenue

(ct = cor.test(df1$Cost, df1$Revenue, method = 'spearman'))
## 
##  Spearman's rank correlation rho
## 
## data:  df1$Cost and df1$Revenue
## S = 24072006, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.8465433

The dynamic coupling between Cost and Revenue is significant, \(\rho_{spearman}\) = 0.85 , p < .0001. Furthermore, Cost seems to predict both short term fluctuations, and long term trends in Revenue such that seasonal trends may be redundant.

Assess the seasonality of Revenue

df1 %>%
  ggplot +
  geom_line(aes(x = Date, y = log(Revenue))) +
  theme_bw()

The plot suggests that a yearly seasonal trend arises, which resets mid-December each year.

Apply the Dickey-Fuller test to assess significance of seasonal trend

adf.test(df1$Revenue)
## Warning in adf.test(df1$Revenue): p-value smaller than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  df1$Revenue
## Dickey-Fuller = -4.1153, Lag order = 9, p-value = 0.01
## alternative hypothesis: stationary

This suggests the time series is stationary and can be appropriately modeled using an ARIMA-like model (ARIMA or ARIMAX)

Conclusions from EDA and graphical analysis

  1. Cost and Revenue should be log-transformed because doing so makes them approximately normal, which will enable us to model mean Revenue more reliably
  2. While yearly trends exist (different years are overall based around different means) and within years monthly trends are fairly constant (i.e., increasing between Jan and Dec), these trends are likely to be confounded with Cost
  3. Given the strong dynamic correlation between Cost and Revenue, the data should be modeled using time series analysis
  4. The best fitting model might be a seasonal ARIMA (“SARIMA”), however it is also possible that an ARIMA model will fit just as well with Cost as an exogenous predictor

Engineer data for modeling

Rescale and recode variables as needed

df2 = df1 %>%
  # Log transform Revenue and Cost
  mutate(log.Revenue = log(Revenue),
         log.Cost = log(Cost))
# Split December into two parts, before and after the peak (for that year)
for (y in unique(df2$year)) {
  dec.data = df2 %>% filter(month == 12, year == y)
  if (nrow(dec.data) > 0) { # if this year has december data
    dec.data[-1:-which.max(dec.data$Cost), 'month'] = 13
    df2[(df2$year == y) & (df2$month == 12), 'month'] = dec.data$month
  }
}
# Store month as an ordered factor
df2 = df2 %>% mutate(across(month, ordered))

Convert the data to a timeseries object

df3 = df2 %>%
  # To aid interpretability, and since we're not using OLS, 
  select(log.Revenue, log.Cost, Revenue, Cost) %>%
  ts # convert to time series using ts()
head(df3)
## Time Series:
## Start = 1 
## End = 6 
## Frequency = 1 
##   log.Revenue log.Cost  Revenue    Cost
## 1    9.571470 8.346331 14349.50 4214.69
## 2    9.507710 7.873647 13463.13 2627.13
## 3    9.239670 8.069680 10297.64 3196.08
## 4    8.747768 7.033903  6296.62 1134.45
## 5    9.239075 8.073356 10291.51 3207.85
## 6   10.115910 8.899129 24733.41 7325.59

Fit SARIMAX model

Note, the first model I fit provided stationary = TRUE, implicitly fitting an ARIMA model, but the residuals were auto-correlated. The SARIMA model held better to the assumption that residuals are independent. Finally, evidence for independence of residuals was highest when using raw Revenue and Cost, rather than their log transformations.

mod <- auto.arima(df3[, "Revenue"], xreg = df3[, "Cost"], stationary = F)
# Check model fit
mod
## Series: df3[, "Revenue"] 
## Regression with ARIMA(1,1,2) errors 
## 
## Coefficients:
##           ar1     ma1      ma2    xreg
##       -0.6280  0.0417  -0.4760  3.7673
## s.e.   0.1358  0.1291   0.0758  0.0744
## 
## sigma^2 = 32821109:  log likelihood = -9858.98
## AIC=19727.96   AICc=19728.02   BIC=19752.39

The best fitting model suggests autocorrelation and non-stationarity, as well as significant explanatory/predictive valuef Cost.

Visualize fit

plt.dat %>% 
  mutate(predicted = mod$fitted) %>%
  pivot_longer(cols = c('Revenue', 'predicted'), names_to = 'name', values_to = 'value') %>%
  ggplot +
  geom_line(aes(yday, y = log(value), color = name, group = name)) +
  scale_x_continuous(
    name = 'Month',
    breaks = as.numeric(df1$yday)[(df1$mday == 1)&(df1$year==2021)],
    labels = month(1:12, label = T),
  ) +
  scale_y_continuous(limits = c(6.5,12.25)) +
  facet_grid(rows = vars(year)) +
  theme_bw()

The model appears to fit very well with just Cost as a predictor

Diagnostics

checkresiduals(mod)

## 
##  Ljung-Box test
## 
## data:  Residuals from Regression with ARIMA(1,1,2) errors
## Q* = 11.67, df = 7, p-value = 0.1119
## 
## Model df: 3.   Total lags used: 10

Question 1

The first day of the week we spend $17,007.92, how would you allocate the remaining budget throughout the week to try to maximize the amount of Revenue that we get for that channel?

Generate new data to reflect different expenditure patterns

budget = 116000 - 17007.92
new_data = tibble(
  uniform = rep(1/6, times = 6),
  increasing = c(.05, .05, .1, .15, .3, .35),
  decreasing = rev(increasing),
  max1 = ifelse(1:6 == 1, .80, (1 - .80)/5),
  max2 = ifelse(1:6 == 2, .80, (1 - .80)/5), # one day takes majority of expenditure
  max3 = ifelse(1:6 == 3, .80, (1 - .80)/5),
  max4 = ifelse(1:6 == 4, .80, (1 - .80)/5),
  max5 = ifelse(1:6 == 5, .80, (1 - .80)/5),
  max6 = ifelse(1:6 == 6, .80, (1 - .80)/5)
)
all(sapply(new_data, sum) == 1) # they should integrate to 1
## [1] TRUE
# Different expenditure patterns:
(new_data = new_data %>% mutate(across(everything(), ~.*budget)))
## # A tibble: 6 × 9
##   uniform increasing decreasing   max1   max2   max3   max4   max5   max6
##     <dbl>      <dbl>      <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1  16499.      4950.     34647. 79194.  3960.  3960.  3960.  3960.  3960.
## 2  16499.      4950.     29698.  3960. 79194.  3960.  3960.  3960.  3960.
## 3  16499.      9899.     14849.  3960.  3960. 79194.  3960.  3960.  3960.
## 4  16499.     14849.      9899.  3960.  3960.  3960. 79194.  3960.  3960.
## 5  16499.     29698.      4950.  3960.  3960.  3960.  3960. 79194.  3960.
## 6  16499.     34647.      4950.  3960.  3960.  3960.  3960.  3960. 79194.

Predict revenue for each scenario

Plots show the forecasted predictions, and provide the cumulative revenue for the forecasted week.

predictions = map(new_data, ~{forecast(mod, xreg = .x)})
map(1:length(predictions), ~{
  p = predictions[[.x]]
  cond = names(new_data)[.x]
  revenue = round(sum(as.vector(p$mean)), 2)
  plt.t = paste0('Total Forecasted Revenue from 4/26/2022 to 5/1/2022
                 condition: ', cond, '; revenue: ', revenue)
  autoplot(p) + 
    coord_cartesian(xlim = c(nrow(df3) - 90, nrow(df3) + 6)) +
    ggtitle(plt.t) +
    theme_minimal()
  })
## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

## 
## [[5]]

## 
## [[6]]

## 
## [[7]]

## 
## [[8]]

## 
## [[9]]

Most profitable day

lapply(predictions[-1:-3], function(x) x$mean) %>%
  as.data.frame %>%
  as.matrix %>%
  diag %>%
  {data.frame(Date = tail(df2$Date,1) + 1:6, Revenue = .)} %>%
  arrange(desc(Revenue)) %>%
  kbl('html') %>% kable_styling()
Date Revenue
2022-04-26 280490.9
2022-04-28 280091.3
2022-04-30 279933.8
2022-05-01 279766.7
2022-04-29 279667.7
2022-04-27 279416.8

The total revenue predictions are all the same because they are based solely on and the effect size of Cost in the model was small relative to the units of Revenue. However, a closer inspection reveals that holding all else constant, spending on this channel on 4/26/2022 will lead to the most Revenue.

minimum_cost = 100
while(T) {
  revenue = forecast(mod, xreg = minimum_cost)$mean
  if (revenue > 0) break else minimum_cost = minimum_cost + 10
}
data.frame(minimum_cost = minimum_cost, revenue = revenue)
##   minimum_cost  revenue
## 1         4740 1.837349

In addition, predicted revenue is in the red unless costs reach at least $4740. As such, it would be inadvisable to spend any less than $4,740 in the channel on our most advantageous day (4/26/2022)

Multilevel modeling approach

Viewing the four scatterplots of the bivariate relationship between Cost and Revenue within each year reveals that each year. Additionally, the slope of the regression line predicting Revenue from Cost seems to vary by year, suggesting a possible interaction. Additionally, to account for the historical data obtained for each month, we can assume each month to have a random intercept. These aspects of the data can be modeled using multilevel models, with random effects for various timing variables (e.g., year or month), in addition to the effect of Cost.

mem1 = lmer(log.Revenue ~ (1|month) + log(Cost)*year, data = df2)
mem2 = lmer(log.Revenue ~ (1|year) + log(Cost), data = df2)
summary(mem1)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: log.Revenue ~ (1 | month) + log(Cost) * year
##    Data: df2
## 
## REML criterion at convergence: 131.5
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.2849 -0.5698  0.0497  0.6316  4.0875 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  month    (Intercept) 0.01661  0.1289  
##  Residual             0.06219  0.2494  
## Number of obs: 980, groups:  month, 13
## 
## Fixed effects:
##                     Estimate Std. Error        df t value Pr(>|t|)    
## (Intercept)          2.29297    0.33859 961.70169   6.772 2.21e-11 ***
## log(Cost)            0.92784    0.04180 971.99989  22.196  < 2e-16 ***
## year2020             2.17702    0.38600 971.99229   5.640 2.23e-08 ***
## year2021            -0.41698    0.39446 971.26037  -1.057    0.291    
## year2022             2.53502    0.51455 971.98987   4.927 9.83e-07 ***
## log(Cost):year2020  -0.25612    0.04740 971.61705  -5.403 8.25e-08 ***
## log(Cost):year2021   0.03524    0.04754 971.92870   0.741    0.459    
## log(Cost):year2022  -0.30298    0.06101 971.96645  -4.966 8.06e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) lg(Cs) yr2020 yr2021 yr2022 l(C):2020 l(C):2021
## log(Cost)   -0.992                                                
## year2020    -0.818  0.809                                         
## year2021    -0.803  0.794  0.788                                  
## year2022    -0.640  0.639  0.570  0.580                           
## lg(Cs):2020  0.826 -0.822 -0.997 -0.784 -0.573                    
## lg(Cs):2021  0.829 -0.826 -0.793 -0.996 -0.592  0.795             
## lg(Cs):2022  0.672 -0.676 -0.578 -0.584 -0.995  0.585     0.602
summary(mem2)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: log.Revenue ~ (1 | year) + log(Cost)
##    Data: df2
## 
## REML criterion at convergence: 383.5
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.4195 -0.6190 -0.0709  0.6385  3.5354 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  year     (Intercept) 0.004903 0.07002 
##  Residual             0.084863 0.29131 
## Number of obs: 980, groups:  year, 4
## 
## Fixed effects:
##              Estimate Std. Error        df t value Pr(>|t|)    
## (Intercept)   3.15923    0.13081 282.33818   24.15   <2e-16 ***
## log(Cost)     0.82103    0.01505 935.79499   54.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##           (Intr)
## log(Cost) -0.960

While these models do show significant effects of year and month, there’s not much evidence to treat these aspects as random effects, due to the low variability between months and years relative to their standard errors. This suggests that we should stick with the ARIMA model

Part 1 Answer:

Based on the model forecasts, the week following April 25th 2022 I would suggest that which days within this week don’t matter as much as how much is spent on this channel. To that end, I would suggest using whatever is budgeted for this week. There is minor evidence that allocating expenditures to 4/26 and 4/28 is the most advantageous, followed by 4/30, 5/1, 4/29 and 4/27, however, the apparent differences in expected revenue across days may may be due to noise in the model and indicative of extrapolation error, so the latter suggestion should be treated with caution.

Part 2

How much Revenue do you predict that we will make in that channel for the given week with the given budget?

Part 2 Answer:

Based on the first model, I predict the week will return $17,007.92 + $262,223.33 for the remaining days, totaling $279,231.20.

Next steps and limitations

One important limitation is that these models were not cross-validated. Therefore the predictions they generate might be due to overfitting. Since we have extensive and dense historical data, this may not be a huge concern in terms of our ultimate predictions about Revenue. However, cross-validation can be a useful tool for model comparison. To that extent, we would be able to assess if the predictions from the multilevel models are inflated relative to the SARIMAX model.

End of question 2

Question 3

rm(list=ls())
library(lubridate)
library(knitr)
library(viridis)
library(lme4)
library(lmerTest)
library(kableExtra)
library(MLmetrics)
library(caret)
library(tseries)
library(forecast)
library(GGally)
library(cluster)
library(parallel)
#library(tidymodels)
suppressPackageStartupMessages(library(tidyverse))
# Custom function to convert monetary notation to floating point / double
dollar_to_numeric = function(.) {
  # Assumes format: $ dollars (any number of digits) . cents (2-digits)
  dollars = as.numeric(gsub('$|[[:punct:]]', '', substr(., 1, nchar(.) - 2)))
  cents = substr(., nchar(.) - 1, nchar(.))
  as.numeric(paste(dollars, cents, sep = '.'))
}

Setup & EDA

set.seed(1)

Read in the data

df <- read_tsv("../data/Question3.tsv")
print(sapply(df, class))
head(df)

Make more convenient names

df = df %>% rename_with(~gsub(' ', '.', .)) %>%
  rename(N.articles = Number.of.Articles.in.Order)

Convert Order Revenue to numeric

df = df %>% mutate(across(Order.Revenue, dollar_to_numeric))

Get a sense of data size/shape

print(setNames(dim(df), c('rows', 'columns')))
##    rows columns 
##   12323       5
print(paste0('Are there NaN? ', anyNA(df)))
## [1] "Are there NaN? FALSE"
psych::describe(df %>% select(where(is.numeric))) %>% kbl('html') %>% kable_styling()
vars n mean sd median trimmed mad min max range skew kurtosis se
Customer.ID 1 12323 15227.4097 1747.661 15078.0 15210.5686 2326.1994 12347.00 18283.00 5936.00 0.0758212 -1.235251 15.743415
Order.ID 2 12323 558655.6037 12842.563 558419.0 558587.7139 16068.4188 536367.00 581586.00 45219.00 0.0463450 -1.138768 115.689375
N.articles 3 12323 297.8672 623.318 167.0 195.5452 152.7078 1.00 15049.00 15048.00 10.3024877 158.897200 5.615022
Order.Revenue 4 12323 515.6503 1080.100 310.8 346.6314 234.4880 0.38 31698.16 31697.78 11.1844052 184.648797 9.729840

How big is the sample of orders and customers

map(df %>% select(Customer.ID, Order.ID), n_distinct)
## $Customer.ID
## [1] 1277
## 
## $Order.ID
## [1] 12323

There are 1277 customers, and a total of 12323 orders (the number of rows). This implies there aren’t any duplicates as well.

Distribution of Revenue

ggplot(df) +
  geom_density(aes(x = Order.Revenue)) +
  geom_vline(aes(xintercept = median(Order.Revenue)), linetype=2) +
  theme_minimal()

ggplot(df) +
  geom_density(aes(x = Order.Revenue)) +
  scale_x_log10() +
  geom_vline(aes(xintercept = median(Order.Revenue)), linetype=2) +
  theme_minimal()

The data is roughly normal after log-transforming Revenue, but there is an overabundance of customers with Revenue values near the median. Heretofore use log revenue instead of revenue

Customer behaviors

In order to classify a new customer as a valuable customer, it may be worthwhile to try to define valuable, based on the behaviors of existing customers. As a result, I will engineer variables to summarize their behavior across their interactions with the company, for example computing things like the median Revenue per article, the total number of articles, and the frequency with which they place orders.

beh = df %>% 
  mutate(Order.Revenue = log(Order.Revenue)) %>%
  group_by(Customer.ID) %>%
  summarise(
    total_revenue = sum(Order.Revenue), # sum of all revenue from customer
    n_orders = n(), # number of orders in total
    # how recent was the last order?
    recency = as.numeric((max(df$Order.Date) + 1) - max(Order.Date)),
    # median order revenue, per order
    median.rev_per_order = median(Order.Revenue),
    # price of article, per order
    rev_per_art_per_order = median(Order.Revenue / N.articles),
    # sum total number of articles ordered
    total.articles = sum(N.articles),
    # the  number of articles per order (median):
    median.art_per_ord = median(N.articles),
    # this variable captures how long customers's orders span in time
    longevity = as.numeric(max(Order.Date) - min(Order.Date)),
    # this counts how much time lags between orders
    Consistency = as.numeric(mean(diff(Order.Date)))
  )

### Plot relationships between variables
p = ggpairs(beh %>% select(-Customer.ID))
ggsave(plot=p + theme(text = element_text(size = 8)),
      filename = "pairs.png", units = 'px', width = 2*1400, height = 2*700)

Viewing the correlations and distributions of this multivariate data can help us understand how to model it, and which features might be useful for predicting revenue.

Conclusions:

  1. The number of orders is positively skewed
  2. # of orders and recency show a heavy positive skew
  3. Consistency is one of the strongest correlates of revenue
  4. Skewed variables should be transformed prior to modeling
  5. Median revenue per order is one of the most normally distributed varaibles, and might make a good target for modeling
  6. Longevity is right-censored
  7. Several variables might be redundant with one another

Modeling

Option 1

  1. Cluster customers based on the above behavioral variables
  2. Place our new customers into a cluster based on existing data
  3. Using cluster means for variables like number of orders, median revenue per order, and total revenue, predict how many orders the customer will make in all, and the total revenue

Find clusters

To run kmeans, the data should be normalized. Since several variable are skewed, and kmeans uses means to quantify distance between customers and cluster centroids, the data are best transformed prior to clustering.

Since there are some examples of multicolinearity in the data, and since several variables have been engineered out of just a couple columns, PCA is a wise choice to do both dimension reduction and normalization.

# get the data formatted, log transform skewed variables
cluster_dat = beh %>% select(-Customer.ID, -longevity) %>%
  mutate(across(c(total_revenue, n_orders, recency, rev_per_art_per_order, 
                  total.articles, median.art_per_ord),
                log)) %>%
  as.matrix
  
# Fit pca model
pca = prcomp(cluster_dat, center = T, scale. = T, retx=T)

# Fit the cluster model
cluster_model = kmeans(x = pca$x[,1:3], centers = 4)
# get the distance matrix to assess of cluster fit
D = dist(pca$x[,1:3], 'euc') # distance matrix
mean(silhouette(cluster_model$cluster, D)[, 'sil_width'])
## [1] 0.2758473

4 clusters seems like a reasonable solution

Read in the test data and process it as the train data was

test = read.table(
  textConnection("
17811 540566 2011-01-10 83 $185.97
17811 541648 2011-01-20 166 $337.70
17811 542149 2011-01-26 120 $108.55
17811 542929 2011-02-02 105 $289.30
17811 543801 2011-02-13 52 $210.70
17811 543839 2011-02-14 30 $103.70
17811 544450 2011-02-20 497 $1,271.55
17811 544642 2011-02-22 50 $139.80
17811 545279 2011-03-01 53 $122.50
17811 545700 2011-03-06 150 $166.45
17811 546524 2011-03-14 45 $110.84
17811 547197 2011-03-21 30 $108.42
17811 549048 2011-04-06 42 $113.10
17811 549830 2011-04-12 57 $117.83
17811 551870 2011-05-04 132 $194.30
17811 553423 2011-05-17 90 $187.42
17811 554362 2011-05-24 43 $150.35
17811 555548 2011-06-05 99 $147.05
17811 558107 2011-06-26 75 $124.76
17811 559671 2011-07-11 55 $114.95
17811 560820 2011-07-21 127 $146.66
17811 563062 2011-08-11 70 $120.42
17811 563843 2011-08-19 57 $108.94
17811 566042 2011-09-08 90 $182.90"), sep = ' '
) %>%
  set_names(names(df)) %>%
  mutate(across(Order.Revenue, dollar_to_numeric),
         across(Order.Date, as.Date))
test.beh = test %>%
  mutate(Order.Revenue = log(Order.Revenue)) %>%
  group_by(Customer.ID) %>%
  summarise(
    total_revenue = sum(Order.Revenue),
    n_orders = n(),
    recency = as.numeric((max(df$Order.Date) + 1) - max(Order.Date)),
    median.rev_per_order = median(Order.Revenue),
    rev_per_art_per_order = median(Order.Revenue / N.articles),
    total.articles = sum(N.articles),
    median.art_per_ord = median(N.articles),
    longevity = as.numeric(max(Order.Date) - min(Order.Date)),
    Consistency = as.numeric(mean(diff(Order.Date)))
  )
test.cluster_dat = test.beh %>% 
  select(-Customer.ID, -longevity) %>%
  mutate(across(c(total_revenue, n_orders, recency, rev_per_art_per_order, 
                  total.articles, median.art_per_ord),
                log)) %>%
  as.matrix

pca.test = predict(pca, newdata = test.cluster_dat)[, 1:3]

Classify test customer

Compute the Mean Squared residual between the new customer’s data and the centroid of each cluster.

We use this to assign the customer where MSE is minimized

centroids = cluster_model$centers
for (i in 1:nrow(centroids)) {
  print(MSE(centroids[i, ], pca.test))
}
## [1] 2.413387
## [1] 7.585298
## [1] 7.3317
## [1] 4.872944

This customer would be classified into cluster 1.

# Get the revenue predictions from the untransformed data
df %>%
  filter(Customer.ID %in% beh$Customer.ID[cluster_model$cluster ==1]) %>%
  group_by(Customer.ID) %>%
  summarise(total_revenue = sum(Order.Revenue)) %>%
  summarise(difference = median(total_revenue) - sum(test$Order.Revenue))
## # A tibble: 1 × 1
##   difference
##        <dbl>
## 1     -2532.

Unfortunately, the predicted Revenue for the customer is already less than what they have generated so far in the test data. The model is likely ill-fitting.

Conclusions

Regression model

A sensible second approach would be to take some of the features extracted from the past order history, and fit regression models

  1. A first model could be fit to predict the number of orders based on other aspects of customer behavior
  2. A second could predict the revenue per order, based on the same predictors
  3. Finally, predicting both number of orders and revenue per order, we could multiply the predictions of each (minus the number of orders already placed) to estimate total revenue for this customer
End of question 3

Question 4

Describe a problem you’ve observed in online shopping or online advertising and how you would approach solving this problem, from a data science perspective.

Answer

One problem I’ve observed in online shopping is in knowing what information a customer is absorbing when viewing a webpage with an array of products and information, that ultimately leads them to their next page, destination, or purchase. Attention and perception are tightly coupled with eye gaze location, but this information is unavailable to companies and ad servers. To make matters worse, users may be employing short-term memory to return to pages that have nothing to do with the ones they’ve viewed most recently. Despite these challenges, I propose that semantic network analysis could provide information to accurately predict products that a user is interested in, increasing the ability to predict whether a subsequently served ad will be relevant and clicked.

The basic assumption is that users are not only searching and internally filtering through ads with an internal model of what is relevant and irrelevant. Users are also learning about what is available from viewing other products and ads. By tracking the attributes of all ads and products on a page, we don’t need explicit knowledge of if the user clicked an ad to determine if they looked at an ad, and even mentally engaged with it. If the subsequent pages they attend to are more semantically related to the content of some ads or products on previous pages (or on multiple of the previous pages visited during shopping), then we can infer they had attended to some of those semantically related ads. Because short-term memory can’t possibly maintain all of the information, it is likely that subsequent pages will be related to previous ones, and sequentially become more relevant to the user’s goals/intersts. Finally, at this point, when information has been accumulated about the types of product attributes that are relevant to the user at that time, we can serve an ad that is semantically related, and therefore relevant, with higher propensity of it leading to a click or purchase.

To solve this problem, we can measure the attributes, or semantic content, of ads and products. This can include broad categorical information, such as ‘is a car’, as well as more specific information such as ‘runs on gas’. Building a semantic model then enables us to determine which features that are similar between subsequently shown ads and products were actually attended, because those features will remain a common thread throughout the multiple page impressions the user makes during shopping. Therefore, to use this information to infer the relevant goals of a user and to anticipate a useful ad, we need only keep track of, and continually update, a semantic network that relates ads and products that were shown on consecutive pages across the user’s journey, and compute the similarity (in real time) between the prominent modules in that network with available ads to serve.